Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 4 de 4
Filter
1.
1st International Conference on Ambient Intelligence in Health Care, ICAIHC 2021 ; 317:459-468, 2023.
Article in English | Scopus | ID: covidwho-2173926

ABSTRACT

During the COVID-19 pandemic, several genetic mutations occurred in the SARS-CoV-2 virus, making more infectious or transmissible. The World Health Organization (WHO) tracks and classifies variants as variants of concern (VOCs) or variants of interest (VOIs), depending on the level of transmissibility and dominance of the variant in the regions. The classification and identification of variants usually occur through sequence alignment techniques, which are computationally complex, making them unfeasible to classify thousands of sequences simultaneously. In this work, an application of the alignment-free method BASiNETEntropy is proposed for the classification of the variants of concern of SARS-CoV-2. The method initially maps the biological sequences as a complex network. From this, the most informative edges are selected through the entropy maximization principle, getting a filtered network containing only the most informative edges. Thus, complex network topological measurements are extracted and used as features vectors in the classification process. Sequences of SARS-CoV-2 variants of concern extracted from NCBI were used to assess the method. Experimental results show that extracted features can classify the variants of concern with high assertiveness, considering few features, contributing to the reduction of the feature space. Besides classifying the variants of concern, unique patterns (motifs) were also extracted for each variant, relative to the SARS-CoV-2 reference sequence. The proposed method is implemented as an open source in R language and is freely available at https://cran.r-project.org/web/packages/BASiNETEntropy/. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

2.
9th International Conference on Mining Intelligence and Knowledge Exploration, MIKE 2021 ; 13119 LNAI:161-173, 2022.
Article in English | Scopus | ID: covidwho-2173807

ABSTRACT

Biological sequence analysis involves the study of structural characteristics and chemical composition of a sequence. From a computational perspective, the goal is to represent sequences using vectors which bring out the essential features of the virus and enable efficient classification. Methods such as one-hot encoding, Word2Vec models, etc. have been explored for embedding sequences into the Euclidean plane. But these methods either fail to capture similarity information between k-mers or face the challenge of handling Out-of-Vocabulary (OOV) k-mers. In order to overcome these challenges, in this paper we aim explore the possibility of embedding Biosequences of MERS, SARS and SARS-CoV-2 using Global Vectors (GloVe) model and FastText n-gram representation. We conduct an extensive study to evaluate their performance using classical Machine Learning algorithms and Deep Learning methods. We compare our results with dna2vec, which is an existing Word2Vec approach. Experimental results show that FastText n-gram based sequence embeddings enable deeper insights into understanding the composition of each virus and thus give a classification accuracy close to 1. We also provide a study regarding the patterns in the viruses and support our results using various visualization techniques. © 2022, Springer Nature Switzerland AG.

3.
Brief Bioinform ; 23(3)2022 05 13.
Article in English | MEDLINE | ID: covidwho-1795369

ABSTRACT

Predicting protein properties from amino acid sequences is an important problem in biology and pharmacology. Protein-protein interactions among SARS-CoV-2 spike protein, human receptors and antibodies are key determinants of the potency of this virus and its ability to evade the human immune response. As a rapidly evolving virus, SARS-CoV-2 has already developed into many variants with considerable variation in virulence among these variants. Utilizing the proteomic data of SARS-CoV-2 to predict its viral characteristics will, therefore, greatly aid in disease control and prevention. In this paper, we review and compare recent successful prediction methods based on long short-term memory (LSTM), transformer, convolutional neural network (CNN) and a similarity-based topological regression (TR) model and offer recommendations about appropriate predictive methodology depending on the similarity between training and test datasets. We compare the effectiveness of these models in predicting the binding affinity and expression of SARS-CoV-2 spike protein sequences. We also explore how effective these predictive methods are when trained on laboratory-created data and are tasked with predicting the binding affinity of the in-the-wild SARS-CoV-2 spike protein sequences obtained from the GISAID datasets. We observe that TR is a better method when the sample size is small and test protein sequences are sufficiently similar to the training sequence. However, when the training sample size is sufficiently large and prediction requires extrapolation, LSTM embedding and CNN-based predictive model show superior performance.


Subject(s)
COVID-19 , SARS-CoV-2 , Amino Acid Sequence , COVID-19/genetics , Humans , Protein Binding , Proteomics , SARS-CoV-2/genetics , Sequence Analysis, Protein , Spike Glycoprotein, Coronavirus/metabolism
4.
Brief Funct Genomics ; 20(3): 181-195, 2021 06 09.
Article in English | MEDLINE | ID: covidwho-1246686

ABSTRACT

With the development of high-throughput sequencing technology, biological sequence data reflecting life information becomes increasingly accessible. Particularly on the background of the COVID-19 pandemic, biological sequence data play an important role in detecting diseases, analyzing the mechanism and discovering specific drugs. In recent years, pretraining models that have emerged in natural language processing have attracted widespread attention in many research fields not only to decrease training cost but also to improve performance on downstream tasks. Pretraining models are used for embedding biological sequence and extracting feature from large biological sequence corpus to comprehensively understand the biological sequence data. In this survey, we provide a broad review on pretraining models for biological sequence data. Moreover, we first introduce biological sequences and corresponding datasets, including brief description and accessible link. Subsequently, we systematically summarize popular pretraining models for biological sequences based on four categories: CNN, word2vec, LSTM and Transformer. Then, we present some applications with proposed pretraining models on downstream tasks to explain the role of pretraining models. Next, we provide a novel pretraining scheme for protein sequences and a multitask benchmark for protein pretraining models. Finally, we discuss the challenges and future directions in pretraining models for biological sequences.


Subject(s)
Algorithms , Computational Biology/methods , Data Mining/methods , High-Throughput Nucleotide Sequencing/methods , Natural Language Processing , Software , Datasets as Topic , Deep Learning , Humans , Models, Theoretical
SELECTION OF CITATIONS
SEARCH DETAIL